Iterative deep neural networks based on proximal gradient descent for image restoration

The algorithm unfolding networks with explainability of algorithms and higher efficiency of Deep Neural Networks (DNN) have received considerable attention in solving ill-posed inverse problems. Under the algorithm unfolding network framework, we propose a novel end-to-end iterative deep neural network and its fast network for image restoration. The first one is designed making use of proximal gradient descent algorithm of variational models, which consists of denoiser and reconstruction sub-networks. The second one is its accelerated version with momentum factors. For sub-network of denoiser, we embed the Convolutional Block Attention Module (CBAM) in previous U-Net for adaptive feature refinement. Experiments on image denoising and deblurring demonstrate that competitive performances in quality and efficiency are gained by compared with several state-of-the-art networks for image restoration. Proposed unfolding DNN can be easily extended to solve other similar image restoration tasks, such as image super-resolution, image demosaicking, etc.

1 Introduction a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 minimize energy functions as tools to solve linear inverse problems. Variational model is expressed as where y is degraded inputs, and x is reconstructed outputs. D denotes a data fidelity term to guarantee that solutions of image restoration accord with degradation process. R is a prior (regularization) term with a regularization parameter λ that ensures image features. It is flexible to handle different image tasks by simply integrating different degradation operations (noise level, blur kernel, and downsampling factor) into equations. Whereas, model-based methods lack an intuitive evaluation. Another approach is deep learning with a pre-learned function F ðy; YÞ, where Θ denotes trainable parameters. Data-driven approaches tend to enjoy better performance. However, learning-based methods suffer from black-box properties and have limitations in specified tasks. Above two categories of methods have their advantages and disadvantages. Therefore, it would recently be attractive to explore their integration with respective merits, dubbed as unrolling iterative methods. Such an integration results in Plug-and-Play (PnP) methods which replace proximal operators with learning-based denoiser prior. Splitting algorithms of PnP methods split an energy function into multiple stand-alone solution functions. Zhang et al. [13] used Half Quadratic Splitting (HQS) to split a problem into a data recovery term and a feature expression term. Fast Fourier Transform (FFT) solves a data recovery sub-problem due to an analytical solution. The denoiser settles a feature expression sub-problem. Lei et al. [14] put forward that Deep Convolutional Neural Networks (DCNN) are inserted into Split Bregman (SB) methods. Chan et al. [15] proved that plug-and-play Alternating Direction Method of Multipliers (PnP-ADMM) converges to a fixed point for any denoising algorithms satisfying asymptotic criterion. Methods without splitting algorithms open a new door to integrate degraded operations into equations. Al-Shabili et al. [16] utilized Bregman Proximal Gradient Methods of PnP (PnP-BPGM) to reduce splitting algorithms for solutions of Poisson inverse problems. Gavaskar et al. [17] proposed that plug-and-play Fast Iterative Shrinkage/ Thresholding Algorithm (PnP-FISTA) is achieved in virtue of Asymmetric denoisers. Nair et al. [18] analyzed the PnP convergence of Iterative Shrinkage/Thresholding Algorithm (ISTA) using asymmetric denoisers. Although superior performances through pre-training can be harvested by PnP approaches, several conceptual problems remain to be addressed. First, hand-crafted parameter adjustment significantly affects the time costs. Second, dynamic characteristics of model optimization are ignored by fixed parameters. Dynamic process to find a better solution is not represented by constant parameters. Third, it is difficult to know which parameters are optimal, and, finally, soundness of image reconstruction profoundly interferes with fluctuation of parameters.
To address above drawbacks, we advocate an end-to-end training structure with trainable parameters to unroll iterative algorithms. It not only infers desirable high-quality images or missing high-frequency information from a large number of degraded images, but also adjusts given parameters to learn automatically. Dong et al. [19] used deep unfolding networks to make up for the insufficiency of parameter tuning. Liu et al. [20] unrolled ADMM into a proximal alternating direction network and used dynamic parameters to guarantee at least fixedpoint convergence when dealing with unknown and intractable regularization terms. Yang et al. [21] put forward that unrolling ADMM networks realize discriminative learning from training data instead of setting hyperparameters by hand in traditional compressive sensing methods. Aimed at artificial tweaking of PnP methods, Wei et al. [22] proposed a parameter automatic tuning network to achieve automatically tuning of internal parameters, which is a tuning-free PnP proximal algorithm. Undoubtedly, computational costs by hand can be greatly controlled by a self-learning technique of parameters.
The contributions of this work are outlined below: • The proximal gradient descent algorithm is unfolded into a novel and simple Iterative Deep Neural Network (IDNN) with the U-Net denoiser. Attention mechanism incorporated into the denoisers effectively understands which image information needs to be emphasized or suppressed.
• An improved Fast Iterative Deep Neural Network (FIDNN) is proposed based on parameter constraints and a momentum factor. Faster convergence speed and shorter testing runtime are obtained without stronger criteria compared to identical iteration-based methods.

Related works
PnP approaches have the benefit of being incredibly convenient. Time costs of parameter adjustment are better controlled by deep unfolding networks. We provide a brief review of two methods based on effective DCNN denoisers.

Plug-and-Play method
PnP methods have recently made significantly empirical progress, particularly with incorporation of learning-based denoisers. Moreover, Convolutional Neural Networks (CNN) have shown good performances through end-to-end training, e.g., FFDNet [10],TNRD [11] and DnCNN [23] for image denoising, DPDNN [19] and IRCNN [24] for non-blind deblurring. These methods demonstrate that CNN can train an excellent mapping function from a large number of degraded images to clean images. As a result, PnP approaches can make use of a pre-trained CNN denoiser to solve the Gaussian-like denoising subproblem x ¼ arg min where λ is a penalty parameter. PnP methods through variable splitting algorithms, such as HQS and SB, decouple data term and prior term of Eq (2). When HQS introduces an auxiliary variable s, Eq (2) becomes a constrained optimization problem given by ðx; sÞ ¼ arg min x;s An equally constrained problem transforms into an unconstrained problem, namely ðx; sÞ ¼ arg min where μ denotes a penalty parameter. Above problem can be addressed by resolving iteratively following subproblems for x and s while holding remaining variables fixed, ffi ffi ffi ffi ffi ffi ffi ffi l=m 8 > > > > < > > > > : In this paper, k is iteration index. Eq (5a) has a closed-form analytic solution x, where the F ð�Þ, F À 1 ð�Þ, and F ðHÞ express FFT, inverse FFT, and complex conjugate of F ð�Þ, respectively. Gradient descent can also solve x-subproblem of Eq (5a) [19]. Any advanced Gaussian denoiser can be plugged into alternating iterations to solve z-subproblem. Therefore, numerous ill-posed inverse problems are quickly addressed using PnP approaches. Algorithm 1 Two-step iterative algorithm Initialization: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi l=m k p Þ;

End while
Output: x k

Deep unfolding network
Deep unfolding networks enhance interpretability of network structures in contrast to pure neural networks. Chen and Pock [11] proposed a flexible frame with a dynamic nonlinear diffusion model based on denoising tasks. Zhang and Ghanem [25] achieved proximal mapping related to sparsity-inducing regularizer without handcraft parameter adjustment. Tolooshams et al. [26] utilized an unfolding autoencoder neural network with an accelerated proximal gradient to learn compression matrix. Based on prior knowledge, model-based iterative networks with stationary layers are interpreted as the convolution and activation operations.
DCNN denoisers can be plugged into end-to-end deep unfolding networks to gain selflearning parameters. Wei et al. [22] achieved parameter automatic learning by proximal algorithms. Zheng et al. [27] used Hybrid ISTA to unfold ISTA with trainable parameters drawing in free-form Deep Neural Networks (DNN) to obtain guaranteed convergence. Jiu and Pustelnik [28] used primal-dual proximal iteration associated with standard penalized co-log-likelihood minimization to design a deep neural network. Iterative-based unfolding networks are used to achieve effectiveness of machine learning and adaptability of formula derivation.

Two-step iterative algorithm
Since deep unfolding networks are well-studied, it is interesting to integrate different degraded operations into an iterative algorithm. Different image restoration problems can be solved by studying uniformity of different degradation operations. To achieve this, a proximal operator is used to implement proximal gradient descent algorithm without splitting algorithms. Taylor expansion linearization equation [29] is calculated as þh � HðHx kÀ 1 À yÞ; ðx À x kÀ 1 Þi þ lFðxÞ 8 > < > : where μ denotes the penalty parameter, kðx À x kÀ 1 Þk 2 2 denotes a proximal operator, y denotes degraded inputs, x denotes restored outputs. For image deblurring, � H is a transpose convolution matrix. And by omitting a data term that is irrelevant to results, Eq (6) is merged into For the convenience of calculation, auxiliary variable z is introduced to substitute for complex and lengthy variable. Variable z is equal to where γ is step size. Therefore, the solution can be expressed as ffi ffi ffi ffi ffi ffi ffi ffi l=m This is a Gaussian denoising problem with a standard deviation parameter s k ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi l=m k p . Clean images are gained using any existing DCNN denoiser, i.e., x k = f(z k ), where f(�) denotes a high-performing denoiser approximating a mapping equation. In summary, proposed iterative algorithm is summed up in Algorithm 1. The two-step algorithm is unfolded into an endto-end neural network based on DCNN denoisers.
Algorithm 2 Three-step iterative algorithm ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi l=m k p Þ;

End while
Output: x k 3.2 Three-step iterative algorithm 3.2.1 Fast iterative algorithm. Fast algorithms, e.g., Fast ADMM [30] and FISTA [31], show that convergence speed is accelerated by momentum factors. In this paper, we therefore adopt momentum factors to speed up convergence. Based on Algorithm 1, a momentum factor ρ is introduced to force the variable x to continue being calculated with a similar inertial force. The momentum factor falls between 0-1. The updated value of variable x is gotten by multiplying difference between two previous iterations by a momentum factor, i.e., The new auxiliary variable � z of accelerated methods changes due to the momentum factor ρ. Auxiliary variable becomes where b and z are intermediate variables of final results, γ represents step size. The fast iterative algorithm is summarized as Algorithm 2. Stimulated by IDNN, Algorithm 2 can also be unfolded into a fast iterative deep neural network. Proposed fast method can accelerate convergence effectively, and detailed description will be given in Section 5.4. Moreover, FIDNN indeed shortens testing runtime than IDNN.

Parameter constraint.
Parameters including step size and a momentum factor are likely to affect image reconstructed solutions. The discovery that FIDNN might result in nonpositive step size and momentum factors is in conflict with how these parameters are defined. To ensure positive convergence, these parameters including fg k ; r k g K¼6 k¼1 must also be subject to specific constraints. Parameter constraints [32] are guaranteed using auxiliary variables. These parameters follow a pattern in our implementation, in which γ smoothly decays with iterations, while ρ monotonously increases. With above rules, parameter constraint can be described as where sp(x) is Softplus equation, i.e., sp(x) = ln(1+ exp(x)). The process that image restoration accords with meaning of model-based iterative solutions can be validly guaranteed.

Deep unfolding network framework
Algorithm 1 and Algorithm 2 are unrolled into end-to-end iterative deep neural networks without numerous manual parameters. Network framework of Algorithm 1 is shown in Fig 1. Model framework of Algorithm 2 possesses a similar structure. One stage of proposed networks corresponds to one iteration of Algorithm 1. For K iterations, briefly introduce the first stage of forwarding propagation. First, variable y 2 R n y is equal to degraded inputs. Variables of x 0 and z 0 are initialized to variable y. Variables of x 0 and y times downgraded operations. Add x 0 to previous results to obtain z 1 . The z 1 is processed by any efficient DCNN denoisers to get x 1 . Denoiser in this paper is high-performance U-Net [33]. The same procedures are carried out six times.

Deep convolutional neural network
Pre-trained DCNN models are attractive to be used as denoisers. Zhang et al. [13] leveraged noise level maps as inputs to train denoiser for image restoration tasks. Tirer and Giryes [34] used IRCNN denoiser to solve image inpainting and deblurring problems. Li and Wu [35] exploited DnCNN denoiser to resolve depth image tasks. Romano et al. [36] utilized explicit regularization by pre-trained TNRD as a Gaussian denoiser to solve deblurring and super-resolution problems. Motivated by U-net for image segmentation, proposed U-net with only convolutional and activation operations is convenient to process for any size of natural images. Different from denoiser sub-network [19], our proposed methods introduce the attention In feature extraction module, there are four similar blocks. For each encoder layer, it consists of a convolution operation of 3×3 kernel and activation operations of Rectified Linear Unit (ReLU) nonlinearity to produce 64-channel feature maps. Each down-sample layer contains a convolution operation followed by an activation function. Receptive field is increased in down-sample layers to reduce spatial resolution of feature maps. Finally, there is an encoder layer with only a convolution and an activation operation. It is emphasized that feature maps are scaled twice as small by scaling factor 2 in down-sample layer, but image feature size is unchanged in encoder layer.
CBAM can be seamlessly integrated into any CNN architecture and trained by end-to-end methods together with basic CNN on account of CBAM is a lightweight general-purpose module. Attention maps are gained by sequentially computing two independent dimensions, namely channel, and space. Input feature maps are multiplied by attention maps to obtain adaptive feature refinement. Feature-channel relationship is exploited by channel attention to produce a channel attention map, focusing on "what" makes sense given an input. Spatial connections of image features are exploited by spatial attention to generate spatial attention maps, focusing on "where" is an informative element. Spatial attention is complementary to channel attention. Attention module effectively boosts information flow to learn which image information to be emphasized or suppressed. Comparative experiments on deblurring and denoising are done to demonstrate benefits of attention mechanism, as indicated in Section 5.2.
The image reconstruction module comprises up-sample layers that increase spatial resolution of feature maps followed by feature decoder layer. For each up-sample layer, it contains a transpose convolution operation of 3 × 3 kernel and ReLU nonlinearity to produce 64-channel feature maps. Feature maps are scaled twice as large by scaling factor 2 in up-sample layer. Reconstructed images suffer from a loss of some of their spatial information during feature extraction process. To compensate for loss of spatial information, cascaded feature maps are obtained by fusing one generated in up-sample layer with one generated in encoder layer. Cascading operations double the number of channels from 64 to 128. For decoder layers, there are five convolution layers. The first four have a convolution layer and ReLU nonlinearity. Only convolutional operations are used in final one. But feature map channel is adjusted through the first convolution operations from 128 to 64. The others generate 64-channel feature maps. Then feature maps are put into the last convolutional layer to generate the same number of channels as observed images. However, denoiser networks predict residual parts instead of directly utilizing outputs of the last convolutional layer as reconstructed images, which has been proved to be more robust. Therefore, a shortcut is exploited from inputs to reconstructed images.

Training process
5.1.1 Training dataset. Observed images are gained utilizing different degraded operations. For denoising, clean images are added with AWGN for different noise levels to produce noisy images. For deblurring, blurry images are gained by convolving clear images with different blur kernels and adding AWGN. Training dataset is DIVerse 2K (DIV2K) resolution image dataset [38]. Each image is randomly cropped into 1000 images of size 128. During training process, these inputs are cropped into 64 size patches. To realize data augmentation, cropped randomly patches are flipped and rotated to generate a total of 250,000 ones.

End-to-end training.
Each DCNN denoisers shares the same parameters to reduce numerous parameters and prevent overfitting. In our implementation, networks are trained using Mean Square Error (MSE) loss function where y n and x n are ith pair of damaged and clear image patches, F ðy n ; YÞ is proposed networks with parameters Θ. ADAM optimizer [39] is utilized to optimize parameters. Convolutional kernels are initialized by Xavier initializers developed in [40]. Warmup scheduler strategy is adopted for learning rate.

Ablation study
Regarding the effects of attention module with U-Net, we conduct whether models have CBAM or not. Several comparative experiments are in Tables 1 and 2. DCNN_N represents a DCNN denoiser without CBAM. FIDNN_N represents a fast iterative network without CBAM. When noise level is high, DCNN denoiser with attention mechanism makes great progress. Regarding Gaussian deblurring experiments from Table 2, FIDNN increases PSNR value by 0.25. Therefore, information flow is effectively taught which information needs to be emphasized or suppressed due to attention module. In future trials, methods proposed in this paper are all introduced into attention mechanism.
To verify the effectiveness of consolidation degradation operations, we implement two types of experiments, i.e., DCNN denoisers and iterative network FIDNN. Comparable trials

Denoising.
We compare our methods with several state-of-the-art denoising methods, including two model-based methods, i.e., BM3D [41] and EPLL [42], and three learningbased methods, i.e., TNRD, DPDNN, and IRCNN. Average PSNR results of different methods are shown in Table 3 on widely-used Set12 dataset [23]. Learning-based methods are superior to model-based methods. DPDNN greatly outperforms IRCNN and TNRD, while FIDNN performs better for higher noise levels than DPDNN. We also test denoising results of Color Berkeley Segmentation Dataset (CBSD68) [43] and Kodak24 dataset [44], as shown in Table 4. Model-based method, i.e., CBM3D [41], is outperformed by FIDNN to 0.81 average PSNR gains for noise level 50 on CBSD68 dataset. Qualitative results of gray images for noise level 50 are shown in Fig 3. DPDNN is surrounded by edge connections, while FIDNN is filled with better and smooth image details. Visual effects of color images are shown in Figs 4 and 5. CBM3D is too smooth to preserve the   edge. Three learning-based methods suffer from poor edge preservation of small objects at a distance. In contrast, proposed method benefits from comprehensive textures and sharper edges.

Deblurring.
Deblurring experiments of non-linear blur kernels are carried out to further confirm wide applicability of proposed methods, as shown in Tables 5 and 6. The blur kernel includes Gaussian blur of size 25 with standard deviations of 1.2 and 1.6, and motion blur of size 19 and 17 in [45]. For Gaussian deblurring, AWGN for noise level 2 is added to blurred images. For motion deblurring, add AWGN for noise level 7.65 to them. The modelbased method, i.e., IDDBM3D [46], and four learning-based methods, including IRCNN, IRCNN+ [13], DPIR [13], and DPDNN are compared with our methods on widely-used Set10 dataset [19]. IRCNN+ refers to the method [13] in which the denoiser sub-network is replaced with IRCNN. Model-based methods perform poorly while processing Gaussian blur. Compared to the same iteration-based method, i.e., DPDNN, 0.12 gains of average PSNR for

Convergence analysis
Effects of detail restoration are likely to be affected by the trend of parameter variation. Under the same configuration, step size of DPDNN shows a downward trend, and penalty parameter shows an upward trend from Fig 8a. This is consistent with meaning of parameters mentioned in this paper. However, DPDNN in later iterations shows a very modest fluctuation. Its unstable noise variance and blur composition go counter to iterative solutions. In early stages, parameters of FIDNN change rapidly. Correspondingly, degraded images become clearer quickly, as shown in  performing results. IRCNN+ converges quickly in early stages, but its stability is poor. Contrarily, FIDNN remains fast and stable convergence with a lower number of iterations.

Model complexity and runtime
Under the same hardware equipment, we test model complexity and testing runtime for several deep learning methods, as shown in Table 7. IRCNN gains the best performance in model FLOPs and parameters. With the same U-Net denoiser, FIDNN_N owns better results of average PSNR for higher noise levels than DPDNN. Numerous convolution parameters of FIDNN

PLOS ONE
Iterative deep neural networks based on proximal gradient descent for image restoration result in longer testing runtime. It is worthwhile to mention that runtime of FIDNN does decrease distinctly over IDNN per image.

Conclusion
This work links variational models of model-based methods to learnable deep learning approaches. Firstly, the proximal operator is used to implement Taylor expansion linearization under energy minimization of a variational function. Proximal gradient descent algorithm is unrolled to IDNN model with proposed U-Net denoiser by end-to-end training. The attention mechanism incorporated into denoiser sub-network effectively understands emphatic or suppressive image information. Furthermore, by introducing a momentum factor that drives reconstruction results to continue iterating with inertial force, IDNN is extended to fast IDNN (FIDNN) without stronger conditions to speed up the convergence.
Self-learning parameters in this paper through an end-to-end approach effectively reduce manually tuning costs. Moreover, proposed iterative solution with trainable parameters can express dynamic characteristics of image reconstruction than constant parameters. The experimental results show that FIDNN with fewer iterations has more stable and faster test convergence than several iterative-based unfolding methods. Due to extensive applicability of proposed models, more computer vision tasks in the future can be addressed by handling different degraded operations simultaneously. The experiments are implemented in Pytorch framework on a PC with an Intel core i7-11700k CPU and an Nvidia RTX 3090 GPU. https://doi.org/10.1371/journal.pone.0276373.t007